-
Notifications
You must be signed in to change notification settings - Fork 12.4k
ggml : add ggml_scale_bias #14417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : add ggml_scale_bias #14417
Conversation
I hope this won't have a significant impact on the performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's a useful extension of the operator.
ggml/src/ggml-cpu/ops.cpp
Outdated
ggml_vec_scale_f32(nc, (float *) ((char *) dst->data + i1*nb1), s); | ||
if (b != 0.0f) { | ||
ggml_vec_acc1_f32(nc, (float *) ((char *) dst->data + i1*nb1), b); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge these in ggml_vec_mad1_f32()
. If you want, you can try to add a GGML_SIMD
version using GGML_F32_VEC_FMA
- it's quite simple. But also can leave it a basic for
loop without SIMD.
Quick question: Is the "scale-bias" nomenclature more appropriate here than "multiply-add"? From an outsider perspective familiar with fused multiply-add ("MAD") operations, I didn't realize that "scale" meant "multiply" and "bias" meant "add" until I took a closer look. |
multiply-add can be confused because we already had
So |
@ggerganov On second thought, I'm worry that extending the kernel for I had a look into My idea is that:
Edit: see comment below |
What is the concern with performance? Adding a value from a constant is about as cheap as it gets. |
Hmm ok maybe I'm just too concern about the fact that cheap ops if being repeatedly called can still make an impact. A quick search in llama.cpp reveal that So I'll go back with the initial proposal of |
ggml/src/ggml-cpu/vec.h
Outdated
vDSP_vsmul(y, 1, &s, y, 1, n); | ||
vDSP_vsadd(y, 1, &b, y, 1, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is vDSP_vmsa
There is vDSP_vsmsa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented in 563aca0
ggml/src/ggml-cpu/vec.h
Outdated
#if defined(__ARM_FEATURE_SVE) | ||
const int sve_register_length = ggml_cpu_get_sve_cnt() * 8; | ||
const int ggml_f32_epr = sve_register_length / 32;//8;//svcntw(); // SVE128:4, SVE256:8, SVE512:16 | ||
const int ggml_f32_step = 2 * ggml_f32_epr; | ||
|
||
GGML_F32_VEC vs = GGML_F32_VEC_SET1(s); | ||
GGML_F32_VEC vb = GGML_F32_VEC_SET1(b); | ||
|
||
const int np = (n & ~(ggml_f32_step - 1)); | ||
svfloat32_t ay1; | ||
svfloat32_t ay2; | ||
for (int i = 0; i < np; i += ggml_f32_step) { | ||
ay1 = GGML_F32_VEC_LOAD(y + i); | ||
ay1 = GGML_F32_VEC_FMA(ay1, vs, vb); | ||
GGML_F32_VEC_STORE(y + i, ay1); | ||
|
||
ay2 = GGML_F32_VEC_LOAD(y + i + 1*ggml_f32_epr); | ||
ay2 = GGML_F32_VEC_FMA(ay2, vs, vb); | ||
GGML_F32_VEC_STORE(y + i + 1*ggml_f32_epr, ay2); | ||
} | ||
// leftovers | ||
// maximum number of leftover elements will be less that ggml_f32_epr. Apply predicated svmad on available elements only | ||
if (np < n) { | ||
svbool_t pg = svwhilelt_b32(np, n); | ||
ay1 = svld1_f32(pg, y + np); | ||
ay1 = svmul_f32_m(pg, ay1, vs); | ||
ay1 = svadd_f32_m(pg, ay1, vb); | ||
svst1_f32(pg, y + np, ay1); | ||
} | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this SVE implementation - we don't have hardware to test it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 50c678f
The only backend that is not currently supported is CANN, could you tag the contributors from CANN? @ggerganov Also it would be nice if we can launch the full CI for testing CUDA and sycl, but I'm not sure how to do this (and I'm not sure if it's possible if the PR is created from a forked repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running this through ggml-ci
would be nice. You can just push a tmp branch and check its results - no need to recreate the PR.
ggml/src/ggml-cuda/scale.cu
Outdated
memcpy(&scale, dst->op_params, sizeof(float)); | ||
memcpy(&bias, (float *) dst->op_params + 1, sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the this more consistent:
memcpy(&scale, (float *) dst->op_params + 0, sizeof(float));
memcpy(&bias, (float *) dst->op_params + 1, sizeof(float));
ggml/src/ggml-cpu/vec.h
Outdated
#if defined(__ARM_FEATURE_SVE) | ||
// scalar ; TODO: Write SVE code | ||
for (int i = 0; i < n; ++i) { | ||
y[i] = y[i]*s + b; | ||
} | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_F32_STEP
doesn't seem to be defined on ARM SVE, so I leave the scalar impl here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's ok for now. I'm having some doubts about these SVE branches - might end up removing them all together.
I ran the ggml-ci and it passed: b7c6ece Will merge this once the CI or this PR is green |
* origin/master: llama : support Jamba hybrid Transformer-Mamba models (ggml-org#7531) ggml : add ggml_scale_bias (ggml-org#14417)
* ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32
* ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32
Ref discussion: #14400 (comment)
Added
ggml_scale_bias(ctx, a, s, b)
in this PR, which allows calculatingx = a * s + b
I only added Metal kernel for now, just for discussion. @ggerganov does this looks good to you?
TODO: support other backends